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and  maintenance  benefits  that  they  have  documented.  Other  efforts  in  the  field  of  power 
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techniques  for  the  test  and  evaluation  community  is  assessed.  The  authors  will  report  on  several 
aspects  of  their  experience  with  GPGPUs:  programmability,  peformance  of  codes  implemented 
in  several  areas  of  computational  science,  and  the  compute  power  per  unit  of  electrical 
consumption.  An  overview  of  code  design  and  implementation  approaches  is  discussed. 

Key  words:  Code  development;  computing  cost;  computing  power  per  watt;  efficient 
low-power  microprocessor  (EM);  energy  conservation;  IBM’s  Blue  Gene;  military 
experimentation;  modeling;  simulation  fidelity;  training;  William  Dally. 


It  is  commonly  held  that  test  and  evaluation 
(T&E)  is  one  of  the  most  critical  steps  in  the 
development  of  virtually  all  defense  systems 
(Fox  et  al.  2003).  It  is  the  central  means  of 
making  sure  that  new  systems  will  reliably 
perform  their  intended  functions  in  their  intended 
environment,  often  combat.  T&E  of  current  systems  is 
an  elaborate  and  time-consuming  process  that  reflects 
both  the  intricacies  of  the  object  of  the  test  and  the 
range  of  equipment,  personnel,  and  environments 
required.  Many  argue  that  this  process  consumes  far 
too  much  of  the  time  that  it  takes  to  put  new  systems 
into  the  hands  of  the  warfighters  and  uses  way  too 
many  resources  without  much  obvious  benefit  for  those 
in  combat. 

One  solution  to  ameliorating  these  costs  and  delays 
is  the  increased  use  of  computer  simulations,  ranging 
from  argent-based-models  of  battlespaces  to  Mechan¬ 
ical  Computer  Aided  Engineering  (MCAE)  analyses 
of  hardware  to  esoteric  simulations  using  computa¬ 
tional  fluid  dynamics  to  assess  everything  from  new 
airframes  to  dispersion  of  chemical  and  biological 
agents. 


Computing  costs  are  significant  as  well.  These  costs 
are  not  only  the  computer  purchase  price,  be  it  a  small 
workstation  or  time  on  High  Performance  Computers 
(HPC).  They  must  include  the  costs  of  training, 
programming,  maintaining,  validating,  and  supporting 
extensive  code  bases  (Kepner  2004).  These  questions 
are  even  more  urgent  because  increasing  emphasis  in 
T&E  concerns  the  expenditures  of  money  and  time  in 
the  development  process.  Efficiency  is  critical  when 
cost  overruns  and  schedule  delays  are  deleterious  and 
costly  (Fox  et  al.  2004). 

One  potential  approach  to  reducing  costs,  time-to- 
roll-out,  and  physical  danger,  aU  the  while  improving 
validity,  transparency,  and  utility,  is  to  adopt  the 
strategy  of  heterogeneous  computing.  Heterogeneous 
computing  is  the  use  of  a  variety  of  different  types  of 
computational  units  to  aid  the  central  processing  unit 
(CPU),  such  as  accelerators  like  General  Purpose 
Graphics  Processing  Units  (GPGPUs),  field  program¬ 
mable  gate  arrays,  and  digital  signals  processors.  There 
is  a  growing  body  of  evidence  on  the  use  of  these 
devices,  some  of  it  created  by  the  authors  in  their  work 
on  large-scale  battlespace  simulations  at  the  U.S.  Joint 
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Forces  Command  (JFCOM)  and  its  Joint  Concept 
Development  and  Experimentation  Directorate  (J9). 

Joint  SemiAutomated  Forces  (JSAF) 

One  program  in  use  at  JFCOM  is  JSAF  code.  JSAF 
is  loaded  onto  a  network  of  processors  in  either 
workstations  or  Linux  clusters.  They  communicate  via 
a  local  or  wide  area  network.  Communication  is 
implemented  with  high  level  architecture  and  a  custom 
version  of  runtime  infrastructure  software,  called  RTT 
s.  A  run  is  implemented  as  a  federation  of  simulators  or 
clients,  and  multiple  clients,  in  addition  to  JSAF,  are 
typically  included  in  a  simulation. 

As  is  common  in  the  T8cE  community,  operational 
imperatives  drive  experimental  designs  that  require 
even  further  expansion  of  simulation  code  capabilities. 
These  needs  include  some  of  the  following: 

•  more  background  entities, 

•  more  complex  behaviors, 

•  larger  geographic  area, 

•  multiple  resolution  terrain,  and 

•  more  complex  environments. 

The  energy  efficiency  issues  addressed  here  are  not 
new  ones.  The  lack  of  energy  resources  and  the 
inability  to  adequately  conserve  existing  power  reserves 
can  arguably  be  advanced  as  one  of  the  reasons  for  the 
loss  of  World  War  II  by  both  major  Axis  powers. 

In  T8cE  settings,  the  need  for  power  conservation  is 
still  paramount,  mainly  for  cost,  maintenance,  and 
habitability  reasons.  These  may  vary  by  region,  e.g., 
power  is  on  the  order  of  three  times  as  expensive  on 
Maui  as  it  is  in  Maryland,  and  by  installation,  e.g.,  size 
and  temperature  constraints  differ  between  a  high 
performance  computing  center  and  a  test  aircraft 
cockpit.  Nevertheless,  all  of  the  previously  mentioned 
parameters  are  important,  critical,  or  vital,  as  the  case 
may  be. 

While  most  equipment  suffers  from  high  heat, 
electronics  are  especially  sensitive.  The  microcircuitry 
now  employed  in  every  phase  of  computing  is  prone  to 
energy  constraints,  the  principal  culprit  being  the  need 
to  transfer  heat  away  from  the  sensitive  circuits  that  are 
generating  their  own  heat.  While  calling  attention  to 
this  concern,  it  is  not  the  intent  of  this  article  to  focus 
on  heat  dissipation  mitigation  techniques. 

This  article  investigates  innovative  and  effective 
ways  to  accomplish  the  same  amount  of  computation 
while  using  significantly  less  total  energy.  The 
technique  studied  by  the  authors  is  to  use  GPGPUs 
to  effectively  handle  computationally  intensive  activity 
“spikes.”  The  authors  report  on  three  specific  aspects  of 
their  use  of  GPGPUs: 


•  code  drafting  and  development  hurdles  and 
opportunities, 

•  codes  modified  in  several  areas  of  computational 
science, 

•  a  wide  range  of  software  results  in  floating  point 
operations  per  second  (FLOPS)  per  watt  param¬ 
eters  in  various  hardware  configurations. 

An  introductory  synopsis  of  algorithmic  design  and 
implementation  strategies  should  allow  the  T6cE  users 
to  conceptualize  the  applicability  of  this  technique  to 
their  own  situations.  To  assist  in  this  analysis,  we 
discuss  and  display  an  actual  working  code  segment 
along  with  the  design  rationale  behind  it.  Further, 
because  such  new  techniques  cannot  be  implemented 
willy-nilly,  the  authors  feel  that  their  experience  in 
training  other  Department  of  Defense  (DoD)  users  to 
implement  the  approach  will  assist  program  managers 
in  scoping  and  justifying  training  requirements. 

GPGPUs  as  computer  accelerators 

Methodology  employed  In  simulation 

To  better  analyze  potential  T6cE  use,  we  set  forth 
the  method  implemented  by  this  team  for  forces 
modeling  and  simulation.  We  use  existing  DOD 
simulation  codes  running  on  advanced  Linux  clusters 
operated  by  JFCOM.  The  previous  J9  clusters  were  on 
Maui  and  at  Wright  Patterson  Air  Force  Base  in  Ohio, 
but  the  new  cluster  enhanced  with  64-bit  CPUs  and 
NVIDIA  8800  GPUs  was  in  Suffolk  at  JFCOM 
(Lucas  et  al.  2007).  In  addition  to  the  benefits  derived 
in  force-on-force  modeling,  the  T8cE  community  at 
large  could  benefit  from  the  acceleration  applied  in 
other  arenas,  such  as 

•  physics-based  phenomenology, 

•  CFD  plume  dispersion, 

•  computational  atmospheric  chemistry, 

•  data  analysis. 

GPGPU  experiments  were  first  conducted  on  a 
more  manageable  code  set  to  ease  the  programming 
burden  and  hasten  the  results.  Basic  Linear  Algebra 
Subprograms  routines  (Dongarra  1993)  were  seen  as 
appropriate  candidates.  An  MCAE  “crash  code” 
arithmetic  kernel  was  used  as  vehicle  for  a  basic 
demonstration  problem,  based  on  earlier  work  (Diniz 
et  al.  2004). 

This  preliminary  characterization  of  GPU  accelera¬ 
tion  focused  on  a  subset  of  the  large  space  of  numerical 
algorithms,  in  this  case  factoring  large  sparse  symmet¬ 
ric  indefinite  matrices.  Such  problems  often  arise  in 
MCAE  applications.  The  Intelligent  Automation,  Inc. 
(ISI)  team  made  use  of  the  single  precision  general 
matrix  multiply  algorithm. 
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Multi-Core  Factorization  Time 


Figure  1.  Multicore  factorization  time,  with  and  without 
the  GPU. 


The  GPU  should  also  he  a  very  attractive  TScE 
computation  accelerator  to  overcome  hurdles,  e.g., 
sparse  matrix  factorization.  Previous  generations  of 
accelerators,  such  as  those  designed  hy  Floating  Point 
Systems  (Charlesworth  and  Gustafson  1986),  were  for 
the  relatively  small  market  of  scientific  and  engineering 
applications.  Contrast  this  with  GPUs  that  are 
designed  to  improve  the  end-user  experience  in  mass- 
market  arenas  such  as  gaming. 

To  get  meaningful  speed-up  in  T6cE  settings,  we 
need  to  reduce  the  GPU  data  transfer  and  interaction 
between  the  host  and  the  GPU  to  an  acceptable 
minimum.  The  T6cE  user  should  be  warned  that  the 
conduct  of  this  analysis  is  not  trivial,  and  the  costs  of  it 
must  always  be  born  in  mind  when  considering  the  use 
of  GPGPUs  (Kepner  2004). 

Implementation  research  results 

Results  for  recent  runs  on  the  C1060  from  NVIDIA 
are  shown  in  Figure  1,  which  plots  the  time  is  takes  to 
factor  the  matrix  as  a  function  of  the  number  of  cores 
employed,  both  with  and  without  the  GPU.  ISI  used  a 
dual-socket  Nehalem  host,  sustaining  10.3  GELOPS 
when  using  one  core,  and  59.7  GELOPS  when  using 
all  eight.  When  the  GPU  is  employed,  it  performs 
6.57E  -t  12  operations,  92  percent  of  the  total,  and 
sustains  98.1  GELOPS  in  doing  so.  The  code’s  overall 
performance  with  the  GPU  improves  to  61.2 
GELOPS  when  one  host  core  is  used,  and  79.8 
GELOPS  with  all  eight.  Eor  perspective,  reordering 
and  symbolic  factorization  take  7.9  seconds,  permuting 
the  input  matrix  takes  2.64  seconds,  and  the  triangular 
solvers  take  1.51  seconds  (Lucas,  Wagenbreth,  and 
Davis  2010). 


The  single  precision  general  matrix  multiply  func¬ 
tion  used  in  this  work  was  supplied  by  NVIDIA.  In 
testing,  it  was  found  that  it  could  achieve  close  to  100 
GELOP/s,  over  50  percent  of  the  peak  performance  of 
the  NVIDIA  GTS  GPU.  Thus,  the  efforts  were 
focused  on  optimizing  the  functions  for  eliminating 
off-diagonal  panels  (GPUl)  and  factoring  diagonal 
blocks  (GPUd). 

Another  application  that  may  have  T8cE  uses  is  a 
fast  and  large-scale  graph-based  construct,  e.g.,  route¬ 
planning  algorithms  found  in  complex  urban  environ¬ 
ment  simulations.  JSAF  currently  employs  a  heuristic 
A*  search  algorithm  to  do  route  planning  for  its 
millions  of  entities — the  algorithm  is  sequential  and 
thus  very  computationally  expensive.  Using  the  GPU, 
the  JSAE  simulation  can  off-load  the  route-planning 
component  to  the  GPU  and  remove  one  of  its  major 
bottlenecks  (Tran  et  al.  2008). 

Early  experimentation  results  at  JFCOM 

T&E  users  may  benefit  from  an  awareness  of  the 
initial  year  of  research  on  JECOM’s  GPU-enhanced 
cluster,  Joshua.  It  was  marked  with  typical  issues  of 
stability,  operating  system  modifications,  optimization, 
and  experience.  All  of  the  major  stated  goals  of  the 
cluster  proposal  were  met  or  exceeded.  Joshua  easily 
met  its  stability  and  availability  requirements  from 
JFCOM. 

Any  potential  user  would  be  interested  in  the  issues 
of  getting  the  machine  up  and  running.  A  typical 
problem  was  getting  the  correct  operating  system 
installed  and  coordinating  that  with  the  NVIDIA 
staffs  recommendations  as  to  varying  versions  and 
incompatibilities.  Those  types  of  issues  are  still  relevant 
today. 

Joshua  provided  24  X  7  X  365  enhanced,  distrib¬ 
uted,  and  scalable  computational  resources  that  did 
enable  joint  warfighters  at  JFCOM  and  international 
partners  to  develop,  explore,  test,  and  validate  twenty- 
first  century  battlespace  concepts.  The  specific  goal  was 
to  enhance  global-scale,  computer-generated  military 
experimentation  by  sustaining  more  than  2,000,000 
entities  on  appropriate  terrain  with  valid  phenomenol- 

ogy- 

This  was  more  than  achieved  in  a  major  break¬ 
through  in  which  10  million  entities  were  simulated  in 
a  Middle  Eastern  urban  environment  complete  with 
demographically  correct  civilians  (Figure  2). 

The  tasks  of  overcoming  implementation  hurdles 
and  stabilizing  the  compute  environment  were  inter¬ 
esting  but  not  daunting.  Agent-based  model  combat 
simulations  of  this  size  and  sophistication  were 
previously  impossible  because  of  limitations  of  com¬ 
putational  power.  The  earlier  pair  of  clusters  had 
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Figure  2.  Screen  capture  of  10  million  entity  run. 

enabled  the  development  and  implementation  of  a 
proven  scalable  code  base  capable  of  using  thousands  of 
nodes  interactively.  The  ISI  team  continues  to  address 
issues  of  interest  to  the  T6cE  community  such  as 
enhanced  security  for  distributed  autonomous  process¬ 
es,  interactive  HPC  paradigms,  use  of  advanced 
architectures,  self-aware  models,  global  terrain  with 
high-resolution  insets,  and  physics-based  phenome¬ 
nology,  many  of  which  have  their  counterparts  in 
T&E. 

There  is  a  general  consensus  that  there  are  two 
possible  ways  to  improve  simulation  fidelity:  (a)  by 
increasing  entity  counts  (quantitatively)  and  (b)  by 
increasing  realism  (qualitatively)  of  entity  behaviors 
and  resolution  of  the  environment.  Numerous  efforts 
have  been  made  to  increase  the  former,  e.g.,  SE 
Express  (Brunnet  et  al.  1998)  and  Noble  Resolve.  They 
included  the  use  of  the  scalable  parallel  processors  or 
clusters  of  compute  nodes  (Wagenbreth  et  al.  2005). 
As  for  the  latter,  JECOM  MScS  teams  have  made 
great  strides  in  improving  entity  behavior  models 
(Ceranowicz  et  al.  2002;  Ceranowicz,  Torpey,  and 
Hines  2006)  by  adding  intelligence  to  the  simulation 
entity  behaviors,  and  with  these  improvements,  entities 
behave  in  more  realistic  fashions.  Because  JECOM  has 
been  required  to  simulate  more  urban  operations,  the 
density  of  the  road  and  trail  networks  has  dramatically 
increased.  This  dictates  an  increase  in  computational 
costs  (in  terms  of  how  entities  relate  to  the  environ¬ 
ment),  which  was  the  heart  of  that  research  effort. 

Power  consumption  analyses 

Finding  a  great  deal  of  interest  in  GPGPU 
acceleration,  the  following  work,  while  necessarily 
preliminary  because  of  the  design  dynamics  of  the 
devices  being  offered,  may  prove  useful  to  those  facing 
power  issues  today.  In  any  case,  these  analyses  do 
support  the  proposition  that  the  use  of  GPGPUs  is 
probably  indicated  as  a  viable  method  for  reducing 
power  consumption  per  unit  of  computation  (usually 


Figure  3.  Ammeter  and  harness  used  for  current  quantification. 


quantified  here  as  FLOPS).  Let  us  examine  the  extra 
power  requirement  for  a  system,  first  at  the  maximum 
power  drain  specified,  then  the  drain  at  high 
computational  loads,  the  drain  at  idle,  and  finally 
the  drain  with  the  GPGPU  card  removed  from  the 
node. 

ISI  had  access  to  three  versions  of  the  NVIDIA 
GPUs  that  were  tested,  the  8800,  9400,  and  9800.  The 
NVIDIA  C1060s  and  C2050s  were  not  available  for 
this  early  test.  Data  on  them  will  be  presented  when  it 
is  available.  In  each  case,  the  host  for  the  GPGPU  was 
chosen  to  best  complement  the  GPU  itself,  so  different 
platforms  were  used  in  every  instance.  While  this  may 
seem  to  be  comparing  apples  and  oranges,  this  is  a 
necessary  result  of  the  choice  of  the  target  GPUs  and 
would  be  more  convoluted  if  they  were  all  tried  on  one 
platform  with  the  concomitant  compromises. 

A  Model  22-602  Radio  Shack  AC  ammeter  probe, 
as  seen  in  Figure  3,  was  used  to  test  current  flow  to  the 
entire  node. 

Wattage  parameters  from  the  vendor  are  typically 
maximum  current  allowed,  not  typical  current  usage 
under  various  conditions.  That  is  why  the  authors 
measured  each  value  themselves.  All  values  in  this 
article  were  either  measured  or  calculated. 

In  each  case,  the  amperage  was  measured,  within  the 
accuracy  of  the  meter,  of  the  current  to  the  node  under 
test  while  exercising  the  GPU  (a)  to  the  maximum 
extent  feasible,  (b)  at  idle  while  running,  (c)  at  a  sleep 
or  hibernate  mode,  and  (d)  then  finally,  with  the 
subject  card  removed.  Cost,  time,  and  instrumentation 
constraints  precluded  measuring  the  entire  power 
consumption  of  the  cluster  Joshua,  so  figures  for  that 
power  consumption  were  derived  from  findings  and 
from  data  available  from  the  vendors. 
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Table  1.  Power  readings  using  different  GPGPUs. 


Status  — > 

Whole  node  watts  (  ±4%  ) 

Max 

Idle 

Sleep 

Removed 

8800 

264 

228 

228 

156 

9400 

444 

360 

324 

275 

9800 

730 

S86 

540 

460 

The  authors  wish 

to  issue 

a  caveat 

about  the 

amperages 

cited.  They  can  reliably  be 

used  for 

comparative  purposes,  but  care  should  be  exercised  if 
trying  to  calculate  actual  amperages  to  be  experienced 
in  different  computational  environments  and  using 
different  analytic  tools.  The  accuracy  of  the  meter  used 
could  be  reliably  certain  to  return  comparative  figures, 
but  the  absolute  numbers  might  be  off  by  some 
significant  fraction.  Test  and  retest  numbers  were  very 
stable,  giving  some  assurance  that  the  comparative 
values  were  meaningful.  The  question  that  was  being 
posed  was:  “How  much  power  does  the  GPGPU  card 
consume  in  each  of  several  different  states  and  with 
different  host  environments?”  (Table  1 ).  The  details  of 
the  hosts  are  omitted  here  for  space  considerations  but 
are  available  from  the  authors  upon  request. 

These  data  indicate  that  the  entire  node  takes  on  the 
order  of  50  percent  more  power  at  full  load  and  that 
the  GPGPU  adds  on  the  order  of  15-20  percent  power 
consumption,  even  at  rest,  assuming  one  GPGPU  card 
per  processor.  For  T&E  purposes,  the  authors  would 
recommend  something  more  on  the  order  of  one 
GPGPU  per  four  to  eight  cores  of  a  CPU. 

GPGPU  Programming  in  CUDA 

Again,  looking  at  the  overall  productivity  issue, 
programming  ease  may  easily  outweigh  power  con¬ 
sumption  and  new  hardware  costs  (Kepner  2004). 
While  we  do  not  want  to  analyze  CUDA  program¬ 
ming  too  stringently,  the  authors  think  it  advisable  to 
show  the  potential  user  some  indication  of  what 
CUDA  programming  entails. 

First,  here  is  some  FORTRAN  code: 

do  j  =  jl.  jr 

do  i  =  jr  +  1,  Id 

x  =  0.0 

do  k  =  jl,  j  -  1 

X  =  X  +  s(i,  k)  *  s(k,  j) 

end  do 

s(i,  j)  =  s(i,  j)  -  X 

end  do 

end  do 


Now,  here  is  the  same  algorithm,  implemented  into 
CUDA: 

ip=0; 

for  (j  =  jl;  j  <=  jr;  j-H-)  { 
ifi(ltid  <= 

gpulskj(ip+ltid)  =  s[IDXS(jl+ltid, 

j)]; 

} 

ip  =  ip  +  (j  -  1)  -  jl  +  1; 

} 

_ syncthreadsO; 

for  (i  =  jr  +  1  +  tid;  i  <=  Id; 

i  +=  GPUL_THREAD_COUl\T)  { 
for  (j  =  jl;  j  <=  jr;  j-H-)  { 

gpuls(j-jl,ltid)  =  s[IDXS(i,j)]; 

} 

ip=0; 

for  (j  =  jl;  j  <=  jr;  j-H-)  { 

X  =  O.Of; 

for  (k  =  jl;  k  <=  (j-1);  k++)  { 

X  =  X  -I-  gpuls(k— jl,ltid)  * 
gpulskj(ip); 

ip  =  ip  +  1; 

} 

gpuls(j-jl,ltid)  -=x; 

} 

for  (j  =  jl;  j  <=  jr;  j-H-)  { 

s[IDXS(i,j)]  =  gpuls(j-jl,ltid); 

} 

} 

A  critical  factor,  if  not  the  most  critical  one,  in 
heterogeneous  programming  is  the  need  to  understand 
which  algorithms  map  well  enough  to  the  GPGPU  to 
warrant  the  overhead  costs  of  porting  and  maintaining 
them.  For  a  more  disciplined  treatment  of  the 
programming  environment  and  approach  that  wiU  be 
useful,  the  reader  is  referred  to  the  authors’  Web  sites 
on  GPGPU  processing  (Davis  2009).  NVIDIA  also 
offers  course  materials  online,  and  the  authors  willingly 
acknowledge  the  assistance  that  NVIDIA  has  given  to 
them.  Like  all  tasks,  there  seems  to  be  a  critical 
experience  level  required  for  reliable  programming  in 
this  mode. 
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Other  approaches  to  better  computation/ 
watt  ratios 

ELM  moves  data  more  efficiently 

Many  in  the  T&E  community  may  be  familiar  with 
computing  pioneer  Bill  Dally.  He  has  been  advancing  a 
different  approach  to  saving  power  during  computa¬ 
tion.  Analyzing  the  power  used  on  microcircuits,  his 
team  observed  that  most  of  the  power  was  being  used 
moving  data  around  the  chip.  Because  many  of  these 
movements  were  the  nonoptimal  artifacts  of  earlier 
VLSI  designs,  he  and  his  Stanford  team  set  out  to 
make  the  data  flows  more  power  efficient  (Dally  et  al. 
2008). 

Professor  Daily’s  (ELM)  project  has  sought  high 
performance  in  the  creation  of  a  low-power  and 
programmable  embedded  system.  He  has  sought  to 
reduce  the  very  inefficient  memory  transfers  by 
designing  a  chip  composed  of  many  efficient  tiles  and 
providing  a  full  software  stack.  It  is  his  intention  that 
ELM  will  be  able  to  reduce  or  eliminate  the  need  of 
fixed  function  logic  blocks  in  passively  cooled  systems. 

The  ELM  team  maintains  that  energy  consumption 
in  modern  processors  is  dominated  by  supplying 
instructions  and  data  to  functional  units.  If  intercon¬ 
nects  benefit  less  than  logic  from  advances  in 
semiconductor  technologies,  driving  the  interconnects 
has  accounted  for  an  increasing  fraction  of  the  energy 
consumed.  This  may  account  for  more  than  70  percent 
of  the  energy  consumed  by  the  computing  unit. 

Providing  a  platform  that  can  execute  real-time 
computationally  intensive  tasks  and  stiU  reduce  the 
power  used  is  the  goal  of  the  ELM  architecture.  This  is 
being  done  in  reaction  to  the  fact  that  embedded 
systems,  e.g.,  ceU  phones,  are  composed  of  micropro¬ 
cessors  and  fixed-function  circuitry.  Programmability 
for  the  system  is  provided  by  the  microprocessor,  but  it 
is  too  inefficient  to  meet  the  computation,  timing,  and 
power  constraints  of  many  communication  and  mul¬ 
timedia  protocols.  This,  in  turn,  requires  fixed  function 
logic  to  be  added  to  embedded  systems  to  provide  the 
necessary  performance.  Unfortunately,  this  cannot  be 
changed  once  the  system  has  been  fabricated. 

ELM  implementations  are  designed  so  that  software 
replaces  the  fixed  function  hardware.  This  removes  the 
inefficiencies  associated  with  this  programmability 
conundrum.  Clearly,  this  is  a  good  thing  because 
software  applications  are  more  cost-effective  to  create 
and  update  than  silicon  and  the  concomitant  power 
savings  are  stiU  realized. 

Ensembles,  which  are  simple  tiles,  are  made  up  of 
software  managed  memory  (EM)  and  several  Ensemble 
Processors  (EPs).  Prof.  Dally  maintains  that  these  small 
tiles  are  much  more  energy  efficient  than  large  cores  and 
offer  more  computation  contexts  for  each  die  area.  The 


team  is  developing  the  tiled  architecture  using  software 
to  take  advantage  of  the  available  computation  resourc¬ 
es.  The  rationale  here  is  that  a  larger  software  up-front 
cost  will  be  amortized  over  a  program’s  lifetime. 

Each  EP  can  issue  both  an  arithmetic  and  memory 
operation  using  a  two-wide  instruction.  Load  latencies 
are  managed  easily.  Prefetching  into  the  instruction 
registers  prior  to  execution  eliminates  stalling  on  jumps. 
Some  old  parallelization  techniques  are  used,  e.g.,  the 
ELM  architecture  supports  single-instruction  multiple 
data  execution  within  an  ensemble.  All  EPs  execute  in 
lock  step  with  instructions  coming  from  a  single  in¬ 
struction  register  file.  This  has  effectively  quadmpled  the 
amount  of  instmctions  that  can  be  stored. 

These  is  a  64-entry,  software-managed  instruction 
register  file  that  is  available  to  the  EPs.  The  register 
files  are  adequate  to  hold  the  inner  loops  of  programs 
with  little  performance  degradation.  Reduced  energy 
requirements  are  realized  by  having  only  one  instruc¬ 
tion  fetch  per  cycle  per  EP. 

The  Stanford  team  reports  that  there  can  be  power 
reductions  of  two  orders  of  magnitude  for  individual 
operations  on  the  silicon.  In  Table  2,  Daily’s  team 
presents  their  data  on  power  reductions  (Balfour  et  al. 
2008). 

This  approach  shows  much  promise  but  may  not  be 
immediately  applicable  to  the  TScE  community  and 
may  be  encumbered  by  the,  as  yet  demonstrable, 
capability  of  journeymen  programmers  to  master  the 
analytical  techniques  required  for  optimization.  Fur¬ 
ther,  the  authors  were  not  able  to  find  any  data  that 
supported  an  analysis  of  overall  power  savings.  In  an 
analogous  way,  there  is  a  temptation  for  GPGPU 
advocates  to  claim  huge  processing  speedups  for  some 
restricted  subroutine,  but  they  are  less  inclined  to  say 
what  the  impact  was  on  the  total  functioning  code  base 
that  is  actually  needed  by  the  user. 

IBM’s  Blue  Gene 

IBM  is  also  contributing  to  power  reduction 
technologies  in  the  form  of  the  “big-iron”  Blue  Gene 
series  of  high  performance  computers.  For  its  Blue  Gene 
initiative,  IBM  integrated  aU  of  the  putatively  essential 
subsystems  on  a  single  chip,  with  each  of  the 
computational  or  communications  nodes  dissipating 
low  power  (about  17  W,  including  DRAMs).  Low 
power  dissipation  enables  the  installation  of  as  many  as 
1,024  compute  nodes  and  the  necessary  communications 
nodes  in  the  standard  computer  rack.  This  can  be  done 
in  accordance  with  standard  limits  on  electrical  power 
supply  and  air  cooling.  As  discussed  earlier,  the 
important  performance  metrics  in  terms  of  power 
(FLOPS  per  watt),  space  (FLOPS  per  square  meter  of 
floor  space),  and  cost  (FLOPS  per  dollar)  have  allowed 
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Table  2.  Power  savings  using  ELM. 


Ensemble  Processor 


Technology 

TSMC  CL013G 

(Vdd  =  1.2  V) 

Clock  freq. 

200  MHz 

Avg.  power 

28  mW 

Multipliers 

16-bit  +  40-bit  acc. 

16.5  pj/op 

irfs 

64  128-bit  registers 

16  pj/read 

18  pj/write 

xrfs 

32  32-bit  registers 

14  pJ/read 

8.7  pJ/write 

orfs 

8  32-bit  registers 

1.3  pJ/read 

1.8  pJ/write 

arfs 

8  16-bit  registers 

1.1  pJ/read 

1.6  pJ/write 

Memory 

8  KB 

33  pJ/read 

29  pJ/write 

RISC  Processor 

Technology 

TSMC  CL013G 

(Vdd  =  1.2  V) 

Clock  freq. 

200  MHz 

Avg.  power 

72  mW 

Multiplier 

16-bit  +  40  bit  acc. 

16.5  pJ/op 

Register  file 

40  32-bit  registers 

17  pJ/read 

22  pJ/write 

Instr.  cache 

8KB  (two-way) 

107  pj/rd 

121  pJ/write 

Data  cache 

8KB  (two-way) 

131  pJ/rd 

121  pJ/write 

IBM  to  scale  up  to  very  high  performance  (Chiu,  Gupta, 
and  Royyuru  2005).  The  issue  may  be,  “Was  this  done  at 
the  expense  of  general  purpose  accessibility?” 

This  is  not  a  classical  “general  purpose”  computer 
because  it  requires  significant  esoteric  skills  to  make 
optimal  use  of  its  power.  The  compute  nodes  are 
attached  to  three  parallel  communications  networks: 
peer-to-peer  communications  use  a  three-dimensional 
toroidal  network,  collective  communications  use  a 
collective  network,  fast  barriers  use  a  global  interrupt 
network,  and  external  communications  are  provided  by 
an  Ethernet  network.  File  system  operations  are 
handled  by  the  I/O  nodes  on  behalf  of  the  compute 
nodes.  Finally,  there  is  a  management  net  to  provide 
access  to  the  nodes  for  configuration,  booting,  and 
diagnostics. 

The  compute  nodes  in  Blue  Gene/L  support  a  single 
user  program  using  a  minimal  operating  system.  A 
limited  number  of  POSIX  calls  are  supported,  and  only 
one  process  may  be  run  at  a  time.  Green  threads  must  be 
implemented  to  simulate  local  concurrency.  C,  C-t-l-,  or 
FORTRAN  are  the  supported  languages  and  as  is 
common  with  clusters,  MPI  is  used  for  communication. 

The  Blue  Gene/L  system  can  be  partitioned  into 
electronically  isolated  sets  of  nodes  to  allow  multiple 
programs  to  run  concurrently.  The  major  drawbacks 
seem  to  be  that  the  hardware  is  not  based  on  a 
commercially  supported  product,  as  are  the  cell 
processor  implementations  and  the  GPGPU  accelera¬ 
tions,  and  on  the  potentially  problematic  programming 
environment. 

Analysis 

Out  of  scientific  restraint,  the  authors  have  assidu¬ 
ously  resisted  the  temptation  to  claim  huge  increases  in 


computational  power  or  efficiencies  in  power  con¬ 
sumption  per  unit  computation.  They  note  that  while 
the  NVIDIA  processors  in  the  8800  through  the 
C2050  series  may  have  potential  compute  power  that  is 
nominally  in  the  several  hundred  gigaFLOPS  range, 
the  issue  of  real  interest  is,  “What  will  they  do  to 
accelerate  the  programs  the  T8cE  user  needs?”  In  the 
authors’  case,  early  experiences  on  the  simulations  run 
by  JFCOM  speak  to  the  evaluation  segment  of  T6cE 
because  that  is  a  major  thrust  at  JFCOM. 

The  GPGPUs  can  attack  some  issues,  most  notably 
the  spikes  of  activity  occasioned  by  a  data  surge  by  the 
sensor  being  simulated  or  a  new  direction  of  travel  for  a 
large  group.  These  spikes  are  tailormade  for  resolution 
by  GPU  processing,  bearing  close  resemblance  to  the 
visualization  algorithms  for  which  the  GPU  was 
designed.  By  easily  handling  the  visualization  (Lucas 
et  al.  2007)  and  route-finding  spikes  (Tran  et  al.  2008), 
the  GPGPUs  do  actually  provide  an  effective  overall 
doubling  of  effective  computing  for  the  cost  of  an 
approximately  30  percent  increase  in  power.  Clearly 
this  is  desirable  at  this  level,  and  considering  the 
newness  of  the  approach,  more  impressive  gains  might 
be  anticipated  for  later. 

In  the  case  of  Joshua,  one  GPGPU  for  every  eight 
cores  was  considered  prudent,  and  experience  has 
shown  that  the  GPGPUs  have  not  been  insufficient  to 
meet  the  needs  imposed  upon  them.  In  this  case,  the 
power  increase  is  more  on  the  order  of  5  percent,  with 
the  anticipated  doubling  of  computational  power. 
Should  this  ratio  turn  out  to  be  valid  in  other,  more 
constrained  implementations,  as  described  earlier,  the 
benefits  will  be  significant.  Increased  habitability, 
reduced  heat  signatures,  increased  battery  life,  reduced 
environmental  stress  on  electronic  components,  and 
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other  benefits  would  accrue  with  almost  trivial  energy 
costs. 

Critically,  the  computing  power  that  the  T6cE 
professionals  need  would  be  made  available  to  them 
where  they  need  it,  on  the  range  or  in  the  field.  This  is 
not  to  say  that  the  authors  find  that  other  approaches 
to  heterogeneous  high  performance  computing  may 
not  also  hold  promise.  As  with  all  new  technologies, 
the  costs  in  terms  of  availability,  adoptability,  and 
training  must  be  kept  in  mind. 

In  more  mundane  settings,  say  a  domestic  comput¬ 
ing  center,  the  cost  savings  in  power  alone  are 
significant.  Because  the  numbers  on  power  usage  for 
large  clusters  such  as  Joshua  are  merely  daunting  in 
Virginia,  in  more  remote  areas  such  as  the  Maui  High 
Performance  Computing  Center  where  they  face 
electric  rates  that  are  literally  multiples  of  what  is 
common  on  the  mainland,  it  is  reasonable  to  look  at 
the  doubling  of  computational  power  as  vital.  It  means 
that  one’s  FLOPS  per  watt  improvements  may 
generate  savings  on  the  order  of  from  $2,500  per  hour 
to  $5,000  per  hour,  at  $0.09  and  $0.20  per  kilowatt 
hour,  respectively,  for  the  two  centers. 

Conclusions 

T&E  will  face  increasing  demands  for  ever-growing 
computer  systems.  Many  new  technologies  offer 
various  paths  to  increasing  computational  power,  while 
restraining  the  numerous  and  varied  costs  of  power 
consumption.  The  authors  maintain  that  even  their 
conservative  approach  and  carefully  substantiated 
claims  support  the  tenet  that  heterogeneous  computing 
displays  many  attractive  features  of  interest  to  the 
TScE  community.  □ 
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